import pandas as pd
= "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/credit-risk/train.csv"
url = pd.read_csv(url) df_train
Abstract and Methodologies
Automated decision systems are increasingly used in financial institutions to assess credit risk and determine loan eligibility. In this blog post, we build upon the theoretical framework of binary decision-making with a linear score function, applying it to a more realistic credit-risk prediction scenario. Our goal is twofold: first, to develop a score function and threshold that optimize a bank’s total expected profit while considering various borrower features; and second, to assess how this decision system impacts different demographic segments. By leveraging data-driven modeling, visualization, and profit-based optimization, we aim to create a more informed and equitable approach to automated lending decisions.
Loading the data
df_train.head()
person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | 43200 | RENT | NaN | VENTURE | B | 1200 | 9.91 | 0 | 0.03 | N | 4 |
1 | 27 | 98000 | RENT | 3.0 | EDUCATION | C | 11750 | 13.47 | 0 | 0.12 | Y | 6 |
2 | 22 | 36996 | RENT | 5.0 | EDUCATION | A | 10000 | 7.51 | 0 | 0.27 | N | 4 |
3 | 24 | 26000 | RENT | 2.0 | MEDICAL | C | 1325 | 12.87 | 1 | 0.05 | N | 4 |
4 | 29 | 53004 | MORTGAGE | 2.0 | HOMEIMPROVEMENT | A | 15000 | 9.63 | 0 | 0.28 | N | 10 |
Analysis of Loan Intent by Age Group
This bar chart below shows how different age groups use their loans for various purposes—such as venture, education, medical, home improvement, personal, and debt consolidation. We can see that borrowers aged 18–29 make up a large portion of total loans, often driven by education and personal loan needs. As age increases, the number of loans generally decreases, but certain categories—like debt consolidation—can become more common in older groups.
Overall, this chart highlights that younger borrowers borrow a lot more money may be more focused on educational or personal financing, while older borrowers might shift their attention to consolidating debt or improving their homes. Understanding these patterns help us in understanding different patternns in who would default on a loan and not.
import matplotlib.pyplot as plt
import seaborn as sns
# Load the data
= "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/credit-risk/train.csv"
url = pd.read_csv(url)
df
# Create additional features for analysis:
# 1. Age groups: we create bins to see how loan intent varies with age.
= [18, 30, 40, 50, 60, 100]
age_bins = ['18-29', '30-39', '40-49', '50-59', '60+']
age_labels 'age_group'] = pd.cut(df['person_age'], bins=age_bins, labels=age_labels)
df[
# ---------------------------
# Visualization 1:
# How does loan intent vary with age and home ownership status?
=(10, 6))
plt.figure(figsize=df, x='age_group', hue='loan_intent')
sns.countplot(data'Loan Intent Distribution by Age Group')
plt.title('Age Group')
plt.xlabel('Count')
plt.ylabel(='Loan Intent')
plt.legend(title
plt.tight_layout() plt.show()
Analysis of Average Loan Amount by Credit History Length
This bar chart shows how the average loan amount changes based on the number of years a borrower has had a credit history. In general, we see that some longer lengths of credit history are associated with higher average loan amounts than others, though the pattern isn’t strictly increasing or decreasing. This suggests that lenders may be willing to extend larger lines of credit to individuals with certain credit history profiles.
For our automated decision system, credit history length could be an important feature because it often reflects a borrower’s past experience with credit and repayment behavior. However, we must be mindful of fairness and potential biases: borrowers who are younger or newer to credit might be at a disadvantage if the model heavily weighs credit history length. Balancing profitability for the bank with equitable access to credit remains a key challenge in designing our scoring and thresholding methods.
# 2. Employment length groups: useful for exploring patterns with job experience.
= [0, 2, 5, 10, df['person_emp_length'].max()]
emp_bins = ['0-1 yrs', '2-4 yrs', '5-9 yrs', '10+ yrs']
emp_labels 'emp_length_group'] = pd.cut(df['person_emp_length'], bins=emp_bins, labels=emp_labels)
df[
# ---------------------------
# Visualization 2:
# Which segments are offered different interest rates? Compare distributions by home ownership.
=(10, 6))
plt.figure(figsize= df.groupby('cb_person_cred_hist_length')['loan_amnt'].mean().reset_index()
avg_loan =avg_loan, x='cb_person_cred_hist_length', y='loan_amnt')
sns.barplot(data'Average Loan Amount by Credit History Length')
plt.title('Credit History Length (Years)')
plt.xlabel('Average Loan Amount')
plt.ylabel(
plt.tight_layout() plt.show()
Analysis of Loan Amount vs. Loan % of Income by Home Ownership
In our scatter plot below, each dot represents a borrower, with the x-axis showing how large the loan is relative to their income (as a percentage) and the y-axis showing the absolute loan amount. The colors indicate different types of home ownership (RENT, MORTGAGE, OWN, OTHER).
As you can see, the data points overlap heavily, making the chart look cluttered. To get a clearer picture, I wo;; split these data into separate graphs for each home ownership category. This will help us see more nuanced patterns—like whether renters tend to have higher loan-to-income ratios compared to those who own or have a mortgage.
'loan_percent_income_pct'] = df['loan_percent_income'] * 100
df[=(10, 6))
plt.figure(figsize
sns.scatterplot(=df,
data='loan_percent_income_pct',
x='loan_amnt',
y='person_home_ownership',
hue=0.7
alpha
)'Loan Amount vs. Loan % of Income by Home Ownership')
plt.title('Loan as % of Person Income')
plt.xlabel('Loan Amount')
plt.ylabel(0, 100) # Focus on 0–100% if most loans fall in this range
plt.xlim(
plt.tight_layout() plt.show()
Analysis of Loan Amount vs. Loan % of Income by Home Ownership (Separated Plots)
By splitting the data into four subplots (one for each home ownership category), we can see that:
Renters typically have loan amounts averaging around $10,000–$12,000, with a wide spread of loan-to-income ratios (averaging around 15–20%). In addition to this we see the most clustering for this group which means, in our dataset most of the people taking our loans are from this group.
Mortgage holders often take out larger loans (averaging $16,000–$18,000) but may have lower loan-to-income ratios (closer to 10% on average).
Owners (those who fully own their homes) tend to borrow moderate amounts ($12,000–$15,000) at ratios of around 12–15%.
Others (less common categories) show a broad mix but generally fall between these ranges.
These distinctions are important for our automated decision system, since each home ownership group presents a different risk and borrowing profile. When designing a score function and threshold to maximize the bank’s profit, it’s important for us to consider whether certain groups (like renters) might be unfairly penalized if they tend to have higher loan-to-income ratios. Ultimately, these separate plots help us fine-tune our model so that we balance profitability with equitable access to credit across different segments of borrowers.
# Create a FacetGrid: one subplot per home ownership category
= sns.FacetGrid(df, col="person_home_ownership", col_wrap=2, height=4)
g map(sns.scatterplot, "loan_percent_income_pct", "loan_amnt", alpha=0.7)
g.
# Set the x-axis limits and labels for clarity
set(xlim=(0, 100))
g."Loan as % of Income", "Loan Amount")
g.set_axis_labels(
plt.tight_layout() plt.show()
Summary of Loan Intent and Home Ownership Segments
This table shows how different combinations of loan intent (e.g., EDUCATION, MEDICAL, PERSONAL) and home ownership (MORTGAGE, OWN, RENT, OTHER) compare in terms of average interest rate, average loan amount, and count of borrowers. We’ve sorted the table by average loan amount in descending order to identify which segments receive the largest lines of credit.
Highest Averages: Segments like PERSONAL–OTHER and MEDICAL–OTHER appear near the top, suggesting they receive higher loan amounts (over $12,000 on average), but also tend to have higher interest rates (11–12%).
Mortgage vs. Rent: Many MORTGAGE segments (e.g., DEBTCONSOLIDATION–MORTGAGE, EDUCATION–MORTGAGE) cluster in the middle, with average loan amounts around $10,000–$11,000 and interest rates near 10–10.6%. Renters often see slightly higher interest rates (11–12%) and somewhat lower loan amounts (around $8,000–$9,000).
Low Counts: Some segments have very few borrowers (like DEBTCONSOLIDATION–OWN with a count of only 62), which may not be reliable for broad conclusions.
From the perspective of building an automated decision system, these patterns hint at where the bank’s profit opportunities and risks might lie. For instance, segments with higher average loan amounts but also higher interest rates could be more profitable—but might also carry greater default risk. Tracking how many borrowers fall into each segment (the “count” column) helps ensure the model doesn’t overly focus on small, potentially unrepresentative groups.
= (
summary_table
df'loan_intent', 'person_home_ownership'], as_index=False)
.groupby([
.agg(=('loan_int_rate', 'mean'),
avg_interest_rate=('loan_amnt', 'mean'),
avg_loan_amount=('loan_amnt', 'count') # how many borrowers in each segment
count
)
)
# Sort by average loan amount (descending) to see which segments get the largest lines of credit
= summary_table.sort_values('avg_loan_amount', ascending=False)
summary_table_sorted_by_amount
summary_table_sorted_by_amount
loan_intent | person_home_ownership | avg_interest_rate | avg_loan_amount | count | |
---|---|---|---|---|---|
17 | PERSONAL | OTHER | 11.675714 | 12366.666667 | 15 |
13 | MEDICAL | OTHER | 12.745000 | 12200.000000 | 13 |
5 | EDUCATION | OTHER | 12.400833 | 12142.857143 | 14 |
9 | HOMEIMPROVEMENT | OTHER | 11.683000 | 10959.090909 | 11 |
8 | HOMEIMPROVEMENT | MORTGAGE | 10.613916 | 10764.017341 | 1384 |
20 | VENTURE | MORTGAGE | 10.468000 | 10606.281060 | 1811 |
0 | DEBTCONSOLIDATION | MORTGAGE | 10.400489 | 10588.756111 | 1841 |
4 | EDUCATION | MORTGAGE | 10.554563 | 10502.178076 | 2089 |
12 | MEDICAL | MORTGAGE | 10.505553 | 10485.867052 | 1730 |
16 | PERSONAL | MORTGAGE | 10.426285 | 10481.223233 | 1868 |
21 | VENTURE | OTHER | 12.274211 | 10367.500000 | 20 |
11 | HOMEIMPROVEMENT | RENT | 11.812287 | 10109.604633 | 1252 |
1 | DEBTCONSOLIDATION | OTHER | 11.566667 | 9783.333333 | 15 |
14 | MEDICAL | OWN | 10.749097 | 9367.755682 | 352 |
10 | HOMEIMPROVEMENT | OWN | 10.922405 | 9242.450980 | 255 |
6 | EDUCATION | OWN | 10.797507 | 8996.177184 | 412 |
18 | PERSONAL | OWN | 10.867262 | 8944.209040 | 354 |
3 | DEBTCONSOLIDATION | RENT | 11.358223 | 8882.754425 | 2260 |
19 | PERSONAL | RENT | 11.535415 | 8826.900046 | 2171 |
23 | VENTURE | RENT | 11.438455 | 8804.566745 | 2135 |
22 | VENTURE | OWN | 10.560000 | 8789.621914 | 648 |
7 | EDUCATION | RENT | 11.315216 | 8685.308193 | 2612 |
15 | MEDICAL | RENT | 11.422580 | 8426.925182 | 2740 |
2 | DEBTCONSOLIDATION | OWN | 14.432909 | 7749.193548 | 62 |
Training and evaluating our Logistic Regression model
I used a logistic regression model to predict whether a prospective borrower will default on a loan. After preprocessing the data—by standardizing numerical features and one-hot encoding categorical variables. I removed rows with missing values, and split the dataset into training and test sets. The model achieved a test accuracy of about 84.2%.
The confusion matrix provides additional insight into the model’s performance:
True Negatives (TN): 3,393 borrowers who did not default and were correctly predicted as non-default.
False Positives (FP): 221 borrowers who did not default but were incorrectly flagged as defaults.
False Negatives (FN): 501 borrowers who defaulted but were missed by the model.
True Positives (TP): 467 borrowers who defaulted and were correctly identified.
Our results suggest that while the model performs reasonably well overall, there is still a balance to be struck between avoiding false positives and false negatives. This is particularly important when designing an automated decision system for credit risk, because both profitability for the bank and equitable access to credit are critical. Further tuning of the threshold and exploration of additional features could help optimize the model even further for its intended purpose
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
# Drop rows with missing values
= df_train.dropna(subset=numeric_features + categorical_features)
df_train
= 'loan_status'
target = df_train.drop(columns=[target])
X = df_train[target]
y
= ['person_age', 'person_income', 'person_emp_length', 'loan_amnt',
numeric_features 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']
= ['person_home_ownership', 'loan_intent', 'cb_person_default_on_file']
categorical_features
= ColumnTransformer(
preprocessor =[
transformers'num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first'), categorical_features)
(
])= preprocessor.fit_transform(X)
X_transformed
# Split the transformed data into training and test sets
= train_test_split(
X_train, X_test, y_train, y_test =0.2, random_state=123
X_transformed, y, test_size
)
# Fit a logistic regression model using the preprocessed features
= LogisticRegression(max_iter=1000)
model
model.fit(X_train, y_train)
# Make predictions on the test set
= model.predict(X_test)
y_pred = accuracy_score(y_test, y_pred)
accuracy = confusion_matrix(y_test, y_pred)
conf_matrix
print("Test Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
Test Accuracy: 0.8424268878219118
Confusion Matrix:
[[3393 221]
[ 501 467]]
Finding weight and threshold
'loan_int_rate_decimal'] = df['loan_int_rate'] / 100.0
df[
def profit_if_repaid(loan_amnt, loan_int_rate_decimal):
"""
If the loan is repaid in full, the bank's profit is:
loan_amnt * (1 + 0.25*loan_int_rate)^10 - loan_amnt
"""
return loan_amnt * (1 + 0.25 * loan_int_rate) ** 10 - loan_amnt
def profit_if_default(loan_amnt, loan_int_rate_decimal):
"""
If the borrower defaults, we assume default happens 3 years into the loan,
and the bank loses 70% of the principal:
loan_amnt*(1 + 0.25*loan_int_rate)^3 - 1.7*loan_amnt
"""
return loan_amnt * (1 + 0.25 * loan_int_rate) ** 3 - 1.7 * loan_amnt
We will now compute predicted probabilities (probability of default) for the training data
= model.predict_proba(X_train)[:, 1] # column 1 = probability of default
y_prob_train
# Split the transformed data into training and test sets
= train_test_split(
X_train, X_test, y_train, y_test =0.2, random_state=123
X, y, test_size
)= X_train['loan_amnt'].to_numpy()
loan_amnt_array = X_train['loan_int_rate_decimal'].to_numpy()
loan_int_rate_array = y_train.to_numpy()
y_train_array = profit_if_repaid(loan_amnt_array, loan_int_rate_array)
profit_repaid = profit_if_default(loan_amnt_array, loan_int_rate_array) profit_default
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/anaconda3/envs/ml-0451/lib/python3.11/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err: File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'loan_int_rate_decimal' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) /Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb Cell 22 line 6 <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=1'>2</a> X_train, X_test, y_train, y_test = train_test_split( <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=2'>3</a> X, y, test_size=0.2, random_state=123 <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=3'>4</a> ) <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=4'>5</a> loan_amnt_array = X_train['loan_amnt'].to_numpy() ----> <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=5'>6</a> loan_int_rate_array = X_train['loan_int_rate_decimal'].to_numpy() <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=6'>7</a> y_train_array = y_train.to_numpy() <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=7'>8</a> profit_repaid = profit_if_repaid(loan_amnt_array, loan_int_rate_array) File ~/anaconda3/envs/ml-0451/lib/python3.11/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key) 4100 if self.columns.nlevels > 1: 4101 return self._getitem_multilevel(key) -> 4102 indexer = self.columns.get_loc(key) 4103 if is_integer(indexer): 4104 indexer = [indexer] File ~/anaconda3/envs/ml-0451/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key) KeyError: 'loan_int_rate_decimal'
Now we are going to find a threshold and identify teh best threshold
# 5. Sweep over thresholds to find the one that maximizes average profit
= np.linspace(0, 1, 101)
thresholds = []
avg_profits
for t in thresholds:
# Predict default if probability >= t
= (y_prob_train >= t).astype(int)
predicted_default # If we predict default, we do NOT give the loan => profit = 0
# If we predict no default, we DO give the loan => actual profit depends on y_train_array
# - If actual y=0 (no default), profit = profit_repaid
# - If actual y=1 (default), profit = profit_default
= 1 - predicted_default # 1 = give loan, 0 = no loan
give_loan # total_profit[i] = give_loan[i] * [ (1 - y[i])*profit_repaid[i] + y[i]*profit_default[i] ]
= give_loan * ((1 - y_train_array) * profit_repaid + y_train_array * profit_default)
total_profit
# Compute average profit per borrower
= total_profit.mean()
avg_profit
avg_profits.append(avg_profit)
= np.array(avg_profits)
avg_profits
# 6. Identify the best threshold
= np.argmax(avg_profits) # index of the best threshold
best_idx = thresholds[best_idx]
best_threshold
print(f"Best Threshold: {best_threshold:.3f}")
Best Threshold: 1.000
# 7. Plot profit vs. threshold
=(8, 5))
plt.figure(figsize='Profit per Borrower')
plt.plot(thresholds, avg_profits, label='red', zorder=10, label='Optimal Threshold')
plt.scatter(best_threshold, best_profit, color'Profit per Borrower (Training Set) vs. Threshold')
plt.title('Threshold (Probability of Default)')
plt.xlabel('Average Profit per Borrower')
plt.ylabel(
plt.legend()
plt.tight_layout() plt.show()